Author

Kaushika Potluri

Published

October 16, 2022

Code
knitr::opts_chunk$set(echo = TRUE)

Loading in packages:

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(dplyr)
library(stats)

##Question 1

Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for Angiography or Bypass surgery?

Angiography

Code
ang_mean <- 18
ang_sd <- 9
ang_ss <- 847

ang_se <- ang_sd/sqrt(ang_ss)

ang_cl <- 0.90  
ang_tail <- (1-ang_cl)/2
ang_tscore <- qt(p = 1-ang_tail, df = ang_ss-1)

ang_ci <- c(ang_mean - ang_tscore * ang_se,
        ang_mean + ang_tscore * ang_se)
print(ang_ci)
[1] 17.49078 18.50922
Code
#assessing Confidence interval
18.50922 - 17.49078
[1] 1.01844

####Margin of error

Code
Margin_of_error_ang <- ang_tscore * ang_se
Margin_of_error_ang * 1.01
[1] 0.5143103

We can be 90% confident that the population mean wait time for the Angiography procedure is between 17.49078 and 18.50922 days with margin of error +/-0.51

Bypass

Code
bypass_mean <- 19
bypass_sd <- 10
bypass_ss <- 539

bypass_se <- bypass_sd/sqrt(bypass_ss)

bypass_cl <- 0.90  
bypass_tail <- (1-bypass_cl)/2
bypass_tscore <- qt(p = 1-bypass_tail, df = bypass_ss-1)

bypass_ci <- c(bypass_mean - bypass_tscore * bypass_se,
        bypass_mean + bypass_tscore * bypass_se)
print(bypass_ci)
[1] 18.29029 19.70971

We can be 90% confident that the population mean wait time for the Bypass procedure is between 18.29029 and 19.70971 days.

Code
#assessing Confidence interval
19.70971 - 18.29029
[1] 1.41942

####Margin of error

Code
Margin_of_error_bypass <- bypass_tscore * bypass_se
Margin_of_error_bypass * 1.41
[1] 1.000692

Therefore, the confidence interval is more narrow for Angiographies.

Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.

Code
#n = Number of American adults (population), x = sample (surveyed)
n = 1031
x = 567 #(believed that college education is essential for success)

#Using prop.test to find p (The CI is 95% by default)
#This  function will return the range for the point estimate at 95% CI.
prop.test(x, n)

    1-sample proportions test with continuity correction

data:  x out of n, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5189682 0.5805580
sample estimates:
        p 
0.5499515 

The percentage of adult Americans who think a college education is necessary for success is p, which is 0.5499515. We have a confidence interval of 95 percent confidence interval that equals, [0.5189682, 0.5805580] which contains the true population mean.

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per semester for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range (in other words, you can assume they know the population standard deviation). Assuming the significance level to be 5%, what should be the size of the sample?

Code
#Evaluating standard deviation using the given values
UMassSD <- (200-30)/4
UMassSD
[1] 42.5

Since the significance level is at 5% our Confidence level is 95%. A 95% confidence level has a z-score of 1.96. With this ideal sample size can be calculated.

Code
#samplesize = ((UMassSD * zscore)/5)^2
samplesize <- ((UMassSD * 1.96)/5)^2
print(samplesize)
[1] 277.5556

The size necessary for the sample is 278.

Question 4

According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.

  1. Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.

Assuming that the sample is random and that the population has a normal distribution.

Null hypothesis: H0: μ = 500

Alternative hypothesis: Ha: μ ≠ 500

We will reject the null hypothesis at a p-value less than or equal to 0.05

Code
p_mean <- 500
s_meanfemale <- 410
s_sizefemale = 9
sd = 90

#find standard error
standarderrorfemale<- sd/sqrt(s_sizefemale)
standarderrorfemale
[1] 30
Code
#calculating t-score
t_stat<- (s_meanfemale-p_mean)/standarderrorfemale
t_stat
[1] -3
Code
#calculating p value
df <- 9-1
p_value<- (pt(t_stat, df=8)) *2
p_value
[1] 0.01707168

Since the p value is less than .05 we can reject the null hypothesis

  1. Report the P-value for Ha : μ < 500. Interpret.

Assuming that the sample is random and that the population has a normal distribution.

Null hypothesis: H0: μ = 500

Alternative hypothesis: Ha: μ ≠ 500

We will reject the null hypothesis at a p-value less than or equal to 0.05

Code
pvalue_lower <- pt(-t_stat, df, lower.tail = FALSE)
pvalue_lower
[1] 0.008535841

As p-value is less than the 0.05, we reject the null hypothesis. Therefore, the mean income of female employees is less than $500.

  1. Report and interpret the P-value for H a: μ > 500.
Code
pvalue_upper <- pt(t_stat, df, lower.tail = FALSE)
pvalue_upper
[1] 0.9914642

As p-value is less than the 0.05, we reject the null hypothesis. Therefore, the mean income of female employees is greater than $500.

Code
#checking if sum = 1
pvalue_upper + pvalue_lower
[1] 1

Question 5

Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0.

  1. Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.

Jones

Code
#first calculating t-value for Jones
t_stat_Jones <- (519.5 - 500)/(10)
t_stat_Jones
[1] 1.95
Code
df <- 1000-1
#now we calculate p value for Jones


p_value_Jones <- 2*pt(t_stat_Jones,df, lower.tail = FALSE)
p_value_Jones
[1] 0.05145555

Smith

Code
#first calculating t-value for Smith
t_stat_Smith <- (519.7 - 500)/(10)
t_stat_Smith
[1] 1.97
Code
df <- 1000-1

#now we calculate p value for Smith
p_value_Smith <- 2*pt(t_stat_Smith,df, lower.tail = FALSE)
p_value_Smith
[1] 0.04911426

b)Using α = 0.05, for each study indicate whether the result is “statistically significant.”

Answer : When they say ‘statistically significant’ it means the p-value is smaller than the 0.05. For Jones, the p-value is 0.051 which is greater than the 0.05 significance level. This means that it is not statistically significant and we cannot reject the null hypothesis.

For Smith, the p-value is 0.049 which is smaller than the significance level. This means it is statistically significant and that we can reject the null hypothesis in favor of the alternative hypothesis.

  1. Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.

Answer : One cannot assess the validity of the result if we do not provide the P-value and you cannot tell how close the p-value is to being significant. Since the values of Jones and Smith’s is barely greater and lesser than 0.05 respectively, it is important to report the p-value because studies with very similar samples could report that the null should or should not be rejected. This could draw very different conclusions.

Question 6

Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.

Answer :

Code
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
Code
#Mean of taxes
Mean_gastaxes <- mean(gas_taxes)
Mean_gastaxes
[1] 40.86278
Code
t.test(gas_taxes, mu = 45, alternative = 'less')

    One Sample t-test

data:  gas_taxes
t = -1.8857, df = 17, p-value = 0.03827
alternative hypothesis: true mean is less than 45
95 percent confidence interval:
     -Inf 44.67946
sample estimates:
mean of x 
 40.86278 

The p-value is 0.03 at 95% confidence level. This is lesser than the 5% significance level. Therefore, this proves that we can reject the null hypothesis that the average tax per gallon was greater than or equal to 45 cents. We can say that the average tax per gallon of gas in the US in 2005 was less than 45 cents with 95% confidence.